Multifont OCR Postprocessing System

نویسندگان

  • Walter S. Rosenbaum
  • John J. Hilliard
چکیده

A series of techniques is being developed to postprocess noisy, multifont, nonformatted OCR data on a word basis to 1 ) determine if a field is alphabetic or numeric; 2) verify that an alphabetic word is legitimate; 3 ) fetch from a dictionary a set of potential entries using a garbled word as a key; and 4) error-correct the garbled word by selecting the most likely dictionary word. Four algorithms were developed using a technique called vector processing (representing alphabetic words as numeric vectors) and also by applying Bayes maximum likelihood solutions to correct the OCR output. The result was the development of a software simulator which processed sequential fields generated by the Advanced Optical Character Reader (in use by the U.S. Postal Service in New York City), performed the four functions indicated above, and selected the correct alphabetic word from a dictionary of 62000 entries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An OCR System for Printed Documents

This paper describes the general structure of a full automated document analysis system for printed documents. The system is based on a character preclassification stage which reduces the number of patterns to recognize and introduces a new contextual processing. This specific approach for multifont printed documents reading is based on pattern character redundancies. With the study of prototyp...

متن کامل

Multifont Classification using Typographical Attributes

This paper introduces a multifont classification scheme to help recognition of multifont and multisize characters. It uses typographical attributes such as ascenders, descenders and serifs obtained from a word image. The attributes are used as an input to a neural network classifier to produce the multifont classification results. It can classify 7 commonly used fonts for all point sizes from 7...

متن کامل

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition ...

متن کامل

Improving OCR Performance in Biomedical Literature Retrieval through Preprocessing and Postprocessing

Today’s information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical cha...

متن کامل

Visual inter-word relations and their use in OCR postprocessing

A technique is presented that uses visual relationships between word images in a document to improve the recognition of the text it contains. This technique takes advantage of the visual relationships between word images that are usually lost in most conventional optical character recognition (OCR) techniques. The visual relations are defined to be the equivalence that exists between images of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IBM Journal of Research and Development

دوره 19  شماره 

صفحات  -

تاریخ انتشار 1975